In this lab, we’ll explore the basics of spatial and geometry operations on vector data in R using the sf package. The following materials are modified from Chapter 4 and Chapter 5of Geocomputation with R by Rovin Lovelace.
# Looking at initial datasets to be used
nz
## Simple feature collection with 16 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1090144 ymin: 4748537 xmax: 2089533 ymax: 6191874
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
## Name Island Land_area Population Median_income Sex_ratio
## 1 Northland North 12500.561 175500 23400 0.9424532
## 2 Auckland North 4941.573 1657200 29600 0.9442858
## 3 Waikato North 23900.036 460100 27900 0.9520500
## 4 Bay of Plenty North 12071.145 299900 26200 0.9280391
## 5 Gisborne North 8385.827 48500 24400 0.9349734
## 6 Hawke's Bay North 14137.524 164000 26100 0.9238375
## 7 Taranaki North 7254.480 118000 29100 0.9569363
## 8 Manawatu-Wanganui North 22220.608 234500 25000 0.9387734
## 9 Wellington North 8048.553 513900 32700 0.9335524
## 10 West Coast South 23245.456 32400 26900 1.0139072
## geom
## 1 MULTIPOLYGON (((1745493 600...
## 2 MULTIPOLYGON (((1803822 590...
## 3 MULTIPOLYGON (((1860345 585...
## 4 MULTIPOLYGON (((2049387 583...
## 5 MULTIPOLYGON (((2024489 567...
## 6 MULTIPOLYGON (((2024489 567...
## 7 MULTIPOLYGON (((1740438 571...
## 8 MULTIPOLYGON (((1866732 566...
## 9 MULTIPOLYGON (((1881590 548...
## 10 MULTIPOLYGON (((1557042 531...
nz_height
## Simple feature collection with 101 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1204143 ymin: 5048309 xmax: 1822492 ymax: 5650492
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
## t50_fid elevation geometry
## 1 2353944 2723 POINT (1204143 5049971)
## 2 2354404 2820 POINT (1234725 5048309)
## 3 2354405 2830 POINT (1235915 5048745)
## 4 2369113 3033 POINT (1259702 5076570)
## 5 2362630 2749 POINT (1378170 5158491)
## 6 2362814 2822 POINT (1389460 5168749)
## 7 2362817 2778 POINT (1390166 5169466)
## 8 2363991 3004 POINT (1372357 5172729)
## 9 2363993 3114 POINT (1372062 5173236)
## 10 2363994 2882 POINT (1372810 5173419)
Spatial subsetting is the process of converting a spatial object into a new object containing only the features that relate in space to another object. This is analogous the attribute subsetting that we covered last week. There are many ways to spatially subset in R, so we will explore a few.
Let’s start by going back to the New Zealand datasets and find all the high points in the state of Canterbury.
#filtering to only select canterbury
canterbury <- nz %>%
filter(Name == "Canterbury")
#this gives us all the points that fall inside of canterbury
c_height <- nz_height[canterbury,]
#making a map to check
tm_shape(nz) +
tm_polygons() +
tm_shape(nz_height)+
tm_dots(col = "red")
#only mapping the points that fall into canterbury
tm_shape(nz) +
tm_polygons() +
tm_shape(c_height) +
tm_dots(col = "red")
# The default is to subset to features that intersect, but we can use other operations, including finding features that do not intersect.
#creating for opposite, using st_dis
outside_height <- nz_height[canterbury, , op = st_disjoint]
#map it
tm_shape(nz) +
tm_polygons() +
tm_shape(outside_height) +
tm_dots(col = "red")
#We can perform the same operations using topological operators. These operators return matrices testing the relationship between features.
#st_intersects (if it touches at all, include it)
sel_sgbp <- st_intersects(x = nz_height, y = canterbury)
#turning this into a logical (anything greater than 0 are the rows we want because that means they intersect it)
sel_logical <- lengths(sel_sgbp) > 0
c_height2 <-nz_height[sel_logical, ]
tm_shape(nz) +
tm_polygons() +
tm_shape(c_height2) +
tm_dots(col = "red")
Points in Canterbury Again We can also use the st_filter function in sf
#Can do the same thing with tidy commands using st_filter
c_height3 <- nz_height %>%
st_filter(y=canterbury, .predicate = st_intersects) #filter using a spatial intersection
tm_shape(nz) +
tm_polygons() +
tm_shape(c_height3) +
tm_dots(col = "red")
We can change the predicate option to test subset to features that don’t intersect
#We can change the predicate option to test subset to features that don’t intersect
outside_height10 <- nz_height %>%
st_filter(y=canterbury, .predicate = st_disjoint)
tm_shape(nz) +
tm_polygons() +
tm_shape(outside_height10) +
tm_dots(col = "red")
Where attribute joining depends on both data sets sharing a ‘key’ variable, spatial joining uses the same concept but depends on spatial relationships between data sets.
Let’s test this out by creating 50 points randomly distributed across the world and finding out what countries they call in.
#creating bounding box
bb <- st_bbox(world)
#setting random points within the dataframe
random_df <- data.frame(
x=runif(n=10,min = bb[1], max = bb[3]),
y=runif(n=10,min = bb[2], max = bb[4])
)
#turning a data frame into an sf object
random_points <- random_df %>%
st_as_sf(coords = c("x", "y")) %>%
st_set_crs("EPSG:4326")
#checking that random points is now an sf object
class(random_points)
## [1] "sf" "data.frame"
#mapping this
tm_shape(world) +
tm_fill() +
tm_shape(random_points) +
tm_dots(col = "red")
#figuring out what countries have the points in them
world_random <- world[random_points, ]
tm_shape(world_random) +
tm_fill() #just filling in shapes that have points in them
#Practicing Using World Random again
#Let’s first use spatial subsetting to find just the countries that contain random points.
bb <- st_bbox(world)
random_df <- data.frame(
x = runif(n = 10, min = bb[1], max = bb[3]), #bounding box for the x is 1 and 3
y = runif(n=10, min = bb[2], max = bb[4]) #bounding box for the y is 2 and 4
) #this will create a datframe with 10 values with x and y objects
world_random <- world[random_points,] #this is how to spatially subset using the rows which align in each data frame provided
random_points <- random_df %>%
st_as_sf(coords = c("x", "y")) %>% #c with the name of the x and y values
st_set_crs("EPSG:4326") #setting the coordinate reference system
tm_shape(world) +
tm_fill() +
tm_shape(random_points) +
tm_dots(col = "red")
#Now let’s perform a spatial join to add the info from each country that a point falls into onto the point dataset.
random_joined <- st_join(random_points, world) #joining points that fall inside the countries
#By default, st_join performs a left join. We change this and instead perform an inner join.
random_joined_inner <- st_join(random_points, world, left = FALSE) #joining points that fall outside of the countries
Sometimes we might want join geographic datasets that are strongly related, but do not have overlapping geometries. To demonstrate this, let’s look at data on cycle hire points in London.
#Non-overlapping joins
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(cycle_hire) +
tm_dots(col = "blue", alpha = 0.5) + #alpha is changing the transparency
tm_shape(cycle_hire_osm) +
tm_dots(col = "red", alpha = 0.5)
#We can check if any of these points overlap.
any(st_touches(cycle_hire, cycle_hire_osm, sparse = FALSE)) #none of these points actually touch
## [1] FALSE
#Let’s say we need to join the ‘capacity’ variable in ‘cycle_hire_osm’ onto the official ‘target’ data in ‘cycle_hire’. The simplest method is using the topological operator st_is_within_distance().
sel <- st_is_within_distance(cycle_hire, cycle_hire_osm, dist = 20)
summary(lengths(sel) > 0) #summarizes the number of points within 20 meters
## Mode FALSE TRUE
## logical 304 438
#Now, we’d like to add the values from ‘cycle_hire_osm’ onto the ‘cycle_hire’ points.
z <- st_join(cycle_hire, cycle_hire_osm, st_is_within_distance, dist = 20)
nrow(cycle_hire) #742 rows in original data
## [1] 742
nrow(z) #762 rows in new data
## [1] 762
#Note: the number of rows of the join is larger than the number of rows in the original dataset. Why? Because some points in ‘cycle_hire’ were within 20 meters of multiple points in ‘cycle_hire_osm’. If we wanted to aggregate so we have just one value per original point, we can use the aggregation methods from last week.
z <- z %>%
group_by(id) %>%
summarize(capacity = mean(capacity))
nrow(z) #now they are the same length
## [1] 742
Similar to attribute data aggregation, spatial data aggregation condenses data (we end up with fewer rows than we started with).
Let’s say we wanted to find the average height of high point in each region of New Zealand. We could use the aggregate function in base R.
nz_agg <- aggregate(x = nz_height, by = nz, FUN = mean)
#x object is the information we want to aggregate, and we want to aggregate it by the y value, then the FUN indicates the function we want to use to aggregate
The result of this is an object with the same geometries as the aggregating feature data set (in this case ‘nz’).
We could also use a sf/dplyr approach.
nz_agg <- st_join(nz, nz_height) %>% #joining onto the aggregating dataset so nz is on the left (because that is what we are joining onto)
group_by(Name) %>%
summarize(elevation = mean(elevation, na.rm = TRUE))
We might want to aggregate data to geometries that are not congruent (i.e. their boundaries don’t line up). This causes issues when we think about how to summarize associated values.
head(incongruent)
## Simple feature collection with 6 features and 2 fields
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity, 1 NA's
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 418580 ymin: 443703 xmax: 422963 ymax: 446978
## Projected CRS: OSGB 1936 / British National Grid
## level value geometry
## 1 Incongruent 4.037919 MULTIPOLYGON (((420799.6 44...
## 2 Incongruent 5.014419 MULTIPOLYGON (((418664 4464...
## 3 Incongruent 4.933000 MULTIPOLYGON (((419964 4462...
## 4 Incongruent 5.120139 MULTIPOLYGON (((420368 4441...
## 5 Incongruent 6.548912 MULTIPOLYGON (((420419.8 44...
## 6 Incongruent 3.749791 MULTIPOLYGON (((421779 4451...
head(aggregating_zones)
## Simple feature collection with 2 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 417686.2 ymin: 443703.6 xmax: 422959.3 ymax: 447036.8
## Projected CRS: OSGB 1936 / British National Grid
## geo_code geo_label geo_labelw geometry
## 5164 E02002332 Leeds 003 <NA> MULTIPOLYGON (((418731.9 44...
## 6631 E02002333 Leeds 004 <NA> MULTIPOLYGON (((419196.4 44...
tm_shape(incongruent) +
tm_polygons() +
tm_shape(aggregating_zones) +
tm_borders(col = "red")
The simplest method for dealing with this is using area weighted spatial interpolation which transfers values from the ‘incongruent’ object to a new column in ‘aggregating_zones’ in proportion with the area of overlap.
iv <- incongruent["value"]
agg_aw <- st_interpolate_aw(iv, aggregating_zones, extensive = TRUE) #aggregating zones are what we are aggregating to
## Warning in st_interpolate_aw.sf(iv, aggregating_zones, extensive = TRUE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
tm_shape(agg_aw) +
tm_fill(col = "value")
While topological relationships are binary (features either intersect or don’t), distance relationships are continuous.
We can use the following to find the distance between the highest point in NZ and the centroid of the Canterbury region.
nz_highest <- nz_height %>%
slice_max(n = 1, order_by = elevation) #this will give us the highest point in new zealand
canterbury <- nz %>%
filter(Name == "Canterbury")
#finding the centroid of canterbury
canterbury_centroid = st_centroid(canterbury)
## Warning in st_centroid.sf(canterbury): st_centroid assumes attributes are
## constant over geometries of x
#finding the distance
st_distance(nz_highest, canterbury_centroid)
## Units: [m]
## [,1]
## [1,] 115540
#Note: this function returns distances with units (yay!) and as a matrix, meaning we could find the distance between many locations at once.
Simplification generalizes vector data (polygons and lines) to assist with plotting and reducing the amount of memory, disk space, and network bandwidth to handle a dataset.
Let’s try simplifying the US states using the Douglas-Peucker algorithm. GEOS assumes a projected CRS, so we first need to project the data, in this case into the US National Atlas Equal Area (epsg = 2163)
us_states2163 <- st_transform(us_states, "EPSG:2163")
us_states2163 = us_states2163
us_states_simp1 = st_simplify(us_states2163, dTolerance = 100000)
tm_shape(us_states_simp1) +
tm_polygons()
## Warning: The shape us_states_simp1 contains empty units (after reprojection).
# proportion of points to retain (0-1; default 0.05)
us_states_simp2 = rmapshaper::ms_simplify(us_states2163, keep = 0.01, keep_shapes = TRUE) #this keeps the shapes/connections between the states
## Warning in ms_de_unit(input): Coercing these 'units' columns to class numeric:
## AREA
#mapping this
tm_shape(us_states_simp2) +
tm_polygons()
Instead of simplifying, we could try smoothing using Gaussian kernel regression.
us_states_simp3 = smoothr::smooth(us_states2163, method = 'ksmooth', smoothness = 6)
tm_shape(us_states_simp3) +
tm_polygons()
Centroids identify the center of a spatial feature. Similar to taking an average, there are many ways to compute a centroid. The most common is the geographic centroid.
nz_centroid <- st_centroid(nz)
## Warning in st_centroid.sf(nz): st_centroid assumes attributes are constant over
## geometries of x
tm_shape(nz) +
tm_fill() +
tm_shape(nz_centroid) +
tm_dots()
Sometimes centroids fall outside of the boundaries of the objects they were created from. In the case where we need them to fall inside of the feature, we can use point on surface methods.
nz_pos <- st_point_on_surface(nz)
## Warning in st_point_on_surface.sf(nz): st_point_on_surface assumes attributes
## are constant over geometries of x
tm_shape(nz) +
tm_fill() +
tm_shape(nz_centroid) +
tm_dots() +
tm_shape(nz_pos) +
tm_dots(col = "red")
Buffers create polygons representing a set distance from a feature.
seine_buffer <- st_buffer(seine, dist = 5000)
tm_shape(seine_buffer) +
tm_polygons()
As we saw in the last lab, we can spatially aggregate without explicitly asking R to do so.
w <- tm_shape(world) +
tm_polygons()
continents <- world %>%
group_by(continent) %>%
summarize(population = sum(pop, na.rm = TRUE))
c <- tm_shape(continents) +
tm_polygons()
tmap_arrange(w, c)
What is going on here? Behind the scenes, summarize() is using st_union() to dissolve the boundaries.
us_west <- us_states %>%
filter(REGION == "West")
us_west_union <- st_union(us_west)
tm_shape(us_west_union) +
tm_polygons()
st_union() can also take 2 geometries and unite them.
texas <- us_states %>%
filter(NAME == "Texas")
texas_union = st_union(us_west_union, texas)
tm_shape(texas_union) +
tm_polygons()